By Derek Lilienthal

About the data

Column

About the data

This data was scraped from Dice.com from July 2021 to August 2021. It total, I have captured around 800+ ‘Data Science’ listings that I web scraped myself using Python. I used a lot of explicit programming rules while defining my word banks to capture things like skills, programming languages, technologies, methodologies, educational level, and years of experience mentioned in each ad. The complete dataset is all technology-related job postings because Dice.com aims to find jobs in a technology-related field. Because this dataset is is only captured over the course of a 30 day time frame from a relatively small job board website, this is merely a small representation on what the common things mentioned in job ads are that were posted on just one website.

Captured features

  • job_listing: Name of the job captured from the title of job listing
  • original_date_posted: The date the job was originally posted on dice.com
  • skills: Skills required for the job listed in the header of the job listing
  • company: The company posting the job listed in the header of the job listing
  • job_city: The city where the job is located
  • job_region: The state where the job is located
  • job_postal_code: The postal code where the job is located
  • intext_job_title: The title of the job. Mentioned in the listing text itself
  • intext_company_name: The name of the company posting the job. Mentioned in the listing text itself
  • intext_date_posted: The date mentioned when the job was last posted or most recently reposted in the text itself
  • intext_location: The location where the job is mentioned in the text itself
  • education_level: The range of education levels mentioned in the job listing itself required for the roll
  • intext_skills: The skills mentioned in the text itself required for the job
  • codeing_languages: Programming languages mentioned in the text itself
  • technologies: The technologies mentioned in the text itself
  • methodologies: The methodologies mentioned in the text itself
  • operating_systems: The operating systems mentioned in the text itself
  • remote: If the job mentions a possibility of remote (or is remote) or not
  • years_experience: The range of years of experience required for the job, mentioned in the text itself
  • date_of_processing: The date which I scraped this data from on Dice.com
  • salary: The salary mentioned in the job listing (never implemented this because it’s too sparse)
  • URL: The exact URL in which I got the listing from

Snapshot of the dataset

job_listing skills years_experience company_name
Senior Data Scientist Artificial Intelligence, Python, IT, SAS, SQL, PowerPoint, Foundation 3 to 5 New York Life Insurance Company
Data Scientist Research, Computer, Programming, Python, Java, SQL, JavaScript, HTTP, SSL, Access 2 comScore
Data Scientist Data, collect, clean, analyze 3 to 5 University Of Delaware
Principal Data Scientist - Search Algorithms, Engineers, Python, Java, Data Mining, Computer, Research 3 to 7 Walmart
Data Scientist - Entry Level Laboratory, Security, Applications, Java, Python, Matlab, Linux, UNIX, Windows NA Lawrence Livermore National Laboratory

Data Cleaning

In order to get the data in a proper form to be tabulated, there was a lot of pre-processing that needed to happen. Mainly, there needed to be scripts writen that would tabulate things like skills, methodologies, ranges for years of experience, etc. However, because I am more fluent in Python than R, I did most of the heavy pre-processing of data in Python using libraries I am more familiar with using like Pandas and Numpy. I did however do some pre-processing in R as well.

Link to Python code used to pre-process the dataset: https://github.com/dblilienthal/Who-What-When-and-Where-are-the-Data-Scientist-Jobs/blob/main/Pre-processing%20the%20data.ipynb

Link to dashboard source code: https://github.com/dblilienthal/Who-What-When-and-Where-are-the-Data-Scientist-Jobs

Column

Logical Diagram of my Web Scraper

Link to scraper: https://github.com/dblilienthal/Web-Scraping

Logical Diagram of Web Scraper

WHO is hiring and WHEN are you ready?

WHO is hiring and WHEN are you ready?

There are a few companies who have been actively recruiting

Column

WHO is hiring?

There were 357 different companies who posted a job ad for a data scientist role

Column

WHEN are you ready to be hired?

These numbers represent a range of years of experience mentioned in each ad

WHAT do you need to know?

WHAT do you need to know?

Aside from knowledge of statistics and general programming, there is a lot of skills not often talked about that are vital to know in order to be a Data Scientist.

Column

150 different skills mentioned

Column

Programming Languages that are common

Methodologies you might need to know

WHERE are the jobs?

WHERE are all the jobs?

Even though many of the jobs are located in major metropolitan areas, a significant portion are offered remotely.

Row

Cities with the most ads

Looking at where the jobs are geographically

Row

Jobs listed as offered remotely